35 Rademacher Complexity
Rademacher complexity measures the complexity of a function class in a sense that if \mathcal{F} contains so many functions such that there exists some functions in \mathcal{F} that can always output the same signs with the random generated Rademacher random variables, then \mathcal{F} will have a high Rademacher complexity.
Definitions
Definition 35.1 A Rademacher variable has a discrete probability distribution where X has the equal probability of being +1 and -1.
The empirical Rademacher complexity measures the ability of the functions in a function class \mathcal{F} to fit the random noise for a fixed sample \mathcal{S}, which is described by the maximum correlation over all f \in \mathcal{F} between f (z_{i}) and \sigma_{i}.
Definition 35.2 Given an i.i.d sample \mathcal{S} = \{ z_{1}, \dots, z_{n} \} from the distribution \mathbb{P}_{\mathcal{Z}^{n}} and n independent Rademacher random variables \sigma = \{ \sigma_{1}, \dots, \sigma_{n} \}, the empirical Rademacher complexity of a class of binary function \mathcal{F} is defined as
\mathrm{Rad}_{\mathcal{S}} (\mathcal{F}) = \mathbb{E}_{\sigma} \left[ \sup_{f \in \mathcal{F}} \frac{ 1 }{ n } \sum_{i = 1}^{n} \sigma_{i} f (z_{i}) \right],
which is a function of the random variable \mathcal{S} and therefore is a random variable.
Therefore, the Rademacher complexity of \mathcal{F} measures the expected noise-fitting-ability of \mathcal{F} over all data sets \mathcal{S} \in \mathcal{Z}^{n} that could be drawn according to the distribution \mathbb{P}_{\mathcal{Z}^{n}}.
Definition 35.3 Then the Rademacher complexity is defined as the expectation of the empirical Rademacher complexity over all i.i.d samples of size n
\mathrm{Rad}_{n} (\mathcal{F}) = \mathbb{E}_{\mathcal{S}} \left[ \mathrm{Rad}_{\mathcal{S}} (\mathcal{F}) \right] = \mathbb{E}_{\mathcal{S}} \left[ \mathbb{E}_{\sigma} \left[ \sup_{f \in \mathcal{F}} \frac{ 1 }{ n } \sum_{i = 1}^{n} \sigma_{i} f (z_{i}) \right] \right].
For completeness, we include the definition of Rademacher average of a set of vectors.
Definition 35.4 Given n independent Rademacher random variables \sigma = \{ \sigma_{1}, \dots, \sigma_{n} \}, the Rademacher average of a set of vectors \mathcal{A} \subseteq \mathbb{R}^{n} is
\mathrm{Rad}_{\mathcal{A}} = \mathbb{E}_{\sigma} \left[ \sup_{\mathbf{a} \in \mathcal{A}} \frac{ 1 }{ n } \sum_{i = 1}^{n} \sigma_{i} a_{i} \right].
Rademacher complexity properties
Non-negativity
The empirical Rademacher complexity and Rademacher complexity are non-negative.
Scaling and translation
Given any function class \mathcal{F} and constants a, b \in \mathbb{R}, denote the function class \mathcal{G} = \{ g (x) = a f (x) + b \}.
\mathrm{Rad}_{\mathcal{S}} (\mathcal{G}) = \lvert a \rvert \mathrm{Rad}_{\mathcal{S}} (\mathcal{F})
\mathrm{Rad}_{n} (\mathcal{G}) = \lvert a \rvert \mathrm{Rad}_{n} (\mathcal{F}).
Symmetrization lemma
Here we proved an important result with the Rademacher complexity using so called symmetrization technique, which involves creating a “ghost” sample as a hypothetical independent copy of the original sample.
Theorem 35.1 For any class of measurable functions \mathcal{F}, the expectation of the maximum error in estimating the mean of any function f \in \mathcal{F} is bounded by 2 times of the Rademacher complexity
\mathbb{E}_{\mathcal{S}} [\phi (\mathcal{S})] = \mathbb{E}_{\mathcal{S}} \left[ \sup_{f \in \mathcal{F}} \left\lvert E_{\mathcal{S}} (f) - \mathbb{E}_{Z} [f (z_{i})] \right\rvert \right] \leq 2 \mathrm{Rad}_{n} (\mathcal{F})
where E_{\mathcal{S}} (f) = \frac{ 1 }{ n } \sum_{i = 1}^{n} f (z_{i}) is the estimated expectation of f on the sample \mathcal{S}.
Rademacher-based uniform convergence
Theorem 35.2 Given a sample \mathcal{S} that is drawn i.i.d from any distribution \mathbb{P}_{\mathcal{Z}^{n}}, if the function class \mathcal{F} only contains the functions f such that f (x) \in [a, a + 1], then for every f \in \mathcal{F}, the difference between its true expectation and estimated expectation is no greater than the error \epsilon with probability at least 1 - \delta
\mathbb{P} (\lvert \mathbb{E}_{\mathcal{Z}} [f (z_{i})] - E_{\mathcal{S}} (f) \rvert \leq \epsilon) \geq 1 - \delta, \quad \forall f \in \mathcal{F}
where the error \epsilon is
\epsilon = 2 \mathrm{Rad}_{n} (\mathcal{F}) + \sqrt{\frac{ \log \frac{ 1 }{ \delta }}{ 2 n }}.
Results for risks
Given a hypothesis class \mathcal{H}, a corresponding loss class with the 0-1 loss can be defined as
\mathcal{L} = \{ l_{h} \mid l_{h} (z) = L (h (\mathbf{x}), y), h \in \mathcal{H}, z \sim \mathcal{Z} \}
and therefore the empirical risk and true risk can be defined as
R_{\mathcal{S}} (h) = E_{\mathcal{S}} (l_{h}), R (h) = \mathbb{E}_{\mathcal{Z}} [l_{h} (z)].
Since all the loss functions l_{h} \in \mathcal{L} have output range [0, 1], we can apply the uniform theorem above to the loss class \mathcal{L} to derive the uniform convergence results for risks.
Corollary 35.1 Given a sample \mathcal{S} that is drawn i.i.d from any distribution \mathbb{P}_{\mathcal{Z}^{n}}, a hypothesis class \mathcal{H}, and the corresponding 0-1 loss class \mathcal{L}, for every hypothesis h \in \mathcal{H}, the difference between its true risk and estimated risk is no greater than the error \epsilon with probability at least 1 - \delta
\mathbb{P} (\lvert R_{\mathcal{S}} (h) - R (h) \rvert \leq \epsilon) \geq 1 - \delta, \quad \forall h \in \mathcal{H}
where the error \epsilon is
\epsilon = 2 \mathrm{Rad}_{n} (\mathcal{L}) + \sqrt{\frac{ \log \frac{ 1 }{ \delta }}{ 2 n }}.
By using the following lemma, we can write the results in terms of the Rademacher complexity the hypothesis class \mathcal{H} instead of the loss class \mathcal{L}.
Lemma 35.1 Given a hypothesis class \mathcal{H} and the corresponding loss class \mathcal{L}, we have
\mathrm{Rad}_{n} (\mathcal{H}) = 2 \mathrm{Rad}_{n} (\mathcal{L}).
Corollary 35.2 Given a sample \mathcal{S} that is drawn i.i.d from any distribution \mathbb{P}_{\mathcal{Z}^{n}} and a hypothesis class \mathcal{H}, for every hypothesis h \in \mathcal{H}, the difference between its true risk and estimated risk is no greater than the error \epsilon with probability at least 1 - \delta
\mathbb{P} (\lvert R_{\mathcal{S}} (h) - R (h) \rvert \leq \epsilon) \geq 1 - \delta, \quad \forall h \in \mathcal{H}
where the error \epsilon is
\epsilon = 2 \mathrm{Rad}_{n} (\mathcal{H}) + \sqrt{\frac{ \log \frac{ 1 }{ \delta }}{ 2 n }}.
Bounding Rademacher complexity
The Rademacher complexity can be upper bounded for any function class with a finite VC dimension.
Massart’s lemma
Lemma 35.2 (Massart’s lemma) Given any set of vectors \mathcal{A} \subseteq \mathbb{R}^{n} the empirical Rademacher average is upper-bounded
\mathrm{Rad} (\mathcal{A}) \leq \frac{ R \sqrt{2 \log \lvert \mathcal{A} \rvert} }{ n }
where R = \sup_{\mathbf{a} \in \mathcal{A}} \lVert \mathbf{a} \rVert_{2}.
Connection with VC theory
Lemma 35.3 The Rademacher complexity of the hypothesis class \mathcal{H} with the finite VC dimension d is upper-bounded
\mathrm{Rad}_{n} (\mathcal{H}) \leq \sqrt{\frac{ 2 d \log \frac{ e n }{ d } }{ n }}.